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SEMIPARAMETRIC RELATIVE-RISK REGRESSION FOR 
INFECTIOUS DISEASE DATA 

By Eben Kenah * 

Biostatistics Department and Emerging Pathogens Institute, 
University of Florida 

This paper introduces semiparametric relative-risk regression mod- 
els for infectious disease data based on contact intervals, where the 
contact interval from person i to person j is the time between the 
onset of infectiousness in i and infectious contact from i to j. The 
hazard of infectious contact from i to j is Ao(r)r(/3jXij), where Ao(t) 
is an unspecified baseline hazard function, r is a relative risk func- 
tion, Po is an unknown covariate vector, and Xij is a covariate vector. 
When who-infects-whom is observed, the Cox partial likelihood is a 
profile likelihood for /3 maximized over all possible Ao(t). When who- 
infects-whom is not observed, we use an EM algorithm to maximize 
the profile likelihood for /3 integrated over all possible combinations of 
who-infected-whom. This extends the most important class of regres- 
sion models in survival analysis to infectious disease epidemiology. 

1. Introduction. Infectious diseases remain an important threat to 
human health and commerce, and understanding the effects of covariates 
on infection transmission is crucial to the design of public health interven- 
tions. The statistical analysis of infectious disease data is complicated by the 
fact that the outcomes (infections) are inherently dependent (Becker, 1989; 
Andersson and Britton, 2000). This problem is especially pronounced for 
diseases transmitted directly from person to person, such as influenza and 
SARS. Epidemiologists have dealt with this problem in three ways. Most 
commonly, they model susceptibility to disease using standard statistical 
methods, such as logistic or Cox regression, that ignore this dependence. 
A second approach is to use discrete-time chain binomial models (Rampey 
et al., 1992) to estimate the probability of escaping infectious contact from 
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infected members of close-contact groups such as households, classrooms, or 
hospital wards. The third and most recent approach is to model the spread 
of disease as a branching process where infectees are the offspring of their 
infectors (Wallinga and Teunis, 2004; White and Pagano, 2008). The time 
intervals between the infection of an infector and the infection of his or her 
infectees are called generation intervals. Generation intervals in the branch- 
ing process are assumed to be independent and identically distributed (iid) 
samples from a known or estimated distribution. 

To understand transmission, it is crucial to separate the effects of co- 
variates on the risk of transmission (i.e., infectiousness and susceptibility) 
from their association with exposure to infected people (Rhodes, Halloran 
and Longini, 1996). Chronic-disease models cannot do this, inherently con- 
flating susceptibility and exposure. At the other extreme, generation and 
serial interval methods model the transmission of disease as a process that 
creates a population, not a process of spread through a preexisting popu- 
lation. Since infected people are the offspring of an infectious parent, these 
models ignore competing risks of infection from multiple infectors (Svens- 
son, 2007; Kenah, Lipsitch and Robins, 2008). Since the generation in- 
tervals are iid, these models force the implicit assumption of constant la- 
tent and infectious periods (Kenah, 2012) and cannot be extended eas- 
ily to model covariate effects. Discrete-time chain binomial models are a 
statistically sound response to the problem of dependence, and they sep- 
arate the effects of covariates on the risk of transmission from their as- 
sociation with exposure to infectious persons. However, their use is lim- 
ited in two ways: First, they are not implemented in standard statistical 
software — a problem solved partially by the publicly- available package Tran- 
Stat (www.epimodels.org/midas/transtat.do). Second, they force the use 
of discrete time. Since infectious disease data are usually recorded by the day 
or week, this is not unnatural. However, continuous-time models corrected 
for ties may offer a more natural and more flexible modeling framework. 

Kenah (2011) showed that parametric methods from survival analysis 
could be extended to analyze infectious disease data by modeling the con- 
tact interval. The contact interval Tij in the ordered pair ij is the time 
between the onset of infectiousness in i and the first infection contact from 
i to j, where infectious contact is defined as a contact sufficient to infect a 
susceptible individual. It is right-censored if the infectious period of i ends 
before i makes infectious contact with j or if j is infected by someone other 
than i. The distribution of Tij provides a concise summary of the evolution 
of infectiousness over time in person i because its hazard function equals the 
hazard of infectious contact with j. These methods solve the problem of de- 
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pendence by treating pairs of individuals, not the individuals themselves, as 
the units of analysis. Kenah (2012) showed that the contact interval distri- 
bution could be estimated nonparametrically by extending the Nelson- Aalen 
estimator from standard survival analysis. These methods assume a homo- 
geneous population which the contact interval distribution is the same for 
all ordered pairs ij where transmission from i to j is possible. As such, 
they are unable to address many important questions in infectious disease 
epidemiology. 

The goal of this paper is to extend the nonparametric estimators in Kenah 
(2012) to obtain a relative- risk regression model similar to that of Cox (1972) 
that will allow the semiparametric estimation of the effects of covariates on 
the hazard of infectious contact. For the ordered pair ij, the covariate vector 
can include infectiousness covariates for i, susceptibility covariates for j, and 
pairwise covariates. 

The rest of Section 1 reviews nonparametric estimation of the contact 
interval distribution, and Section 2 extends this to relative risk regression 
models. Our derivations are based on counting processes and martingales. 
Good introductions to these ideas are given in Kalbfleisch and Prentice 
(2002) and Aalen, Borgan and Gjessing (2009); Fleming and Harrington 
(1991) and Andersen et al. (1993) have more detailed discussions. Section 3 
explores the performance of the regression models in simulations, and Sec- 
tion 4 uses them to analyze data from Los Angeles County during the 2009 
influenza A(HINI) pandemic. Section 5 discusses the promise and peril of 
semiparametric relative risk regression in infectious disease epidemiology. 

1.1. Stochastic S(E)IR epidemic model. Consider a closed population of 
n individuals assigned indices 1 . . . n. Each individual is in one of four states: 
susceptible (S), exposed (E), infectious (I), or removed (R). Person i moves 
from S to E at his or her infection time ti, with tj = oo if i is never infected. 
After infection i has a latent period of length ej, during which he or she is 
infected but not infectious. At time ti + Ei, i moves from E to I, beginning 
an infectious period of length Li. At time ti + Si + Li, i moves from I to R. 
Once in R, i can no longer infect others or be infected. The latent period is 
a nonnegative random variable, the infectious is a strictly positive random 
variable, and both have finite mean and variance. 

An epidemic begins with one or more persons infected from outside the 
population, which we call imported infections. For simplicity, we assume that 
epidemics begin with one or more imported infections at time and there 
are no other imported infections. 

After becoming infectious at time ti+Si, person i makes infectious contact 
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with j ^ i at time Uj = t « + + T*j , where the infectious contact interval 
T*j is a strictly positive random variable with t*^ = cxd if infectious contact 
never occurs. Since infectious contact must occur while i is infectious or 
never, t*^ G (0, lj] or t*^ = oo. We define infectious contact to be sufficient 
to cause infection in a susceptible person, so tj < tij. 

For each ordered pair ij, let Cij = 1 if infectious contact from i to j 
is possible and Cij = otherwise. We assume that the infectious contact 
interval tIj is generated in the following way: A contact interval Tij is drawn 
from a distribution with hazard function \ij{T). If Tij < ii and Cij = 1, then 
T*j = Tij. Otherwise, T*j = oo. In this paper, we assume the contact intervals 
in all ordered pairs ij are independent and have finite mean and variance. 

Following Wallinga and Teunis (2004), let Vj denote the index of the 
person who infected person j, with Vj = for imported infections and Vj = 
oo for persons not infected at or before time T. The transmission network is 
the directed network with an edge from Vj to j for each j such that tj < T. It 
can be represented by a vector v = (ui, . . . , Vn)- Let Vj = {i : Cijli{tj) = 1} 
denote the set of possible infectors of person j, which we call the infectious 
set of person j. Let V denote the set of all possible v consistent with the 
observed data. A v G V can be generated by choosing a Vj G Vj for each 
non- imported infection j. 

Our population has size n, and we observe the times of all S — )• E (in- 
fection), E — )• I (onset of infectiousness), and I — )• R (removal) transitions 
in the population between time and time T. For all ordered pairs ij such 
that i is infected, we observe Cij. 

1.2. Censoring. We assume that we can observe Tij only if j is infected 
by i at time + + Tij . Clearly, Tij can be observed only if Cij = 1 . The 
following processes can right-censor Tij 

1. Xj(T) = 1^-^(0, t-] indicates whether i remains infectious at infectiousness 
age r. Thus, i makes infectious contact with j at infectiousness age Tij 
only if Xj(rij) = 1. 

2. Sijij) = lj.+e.+T-<fj. indicates whether j remains susceptible when i 
reaches infectiousness age r. Thus, j can be infected by i at time tij 
only if Sij{Tij) = 1. 

3. Assume that infection in person j can be observed until time Tj. Then 
3^jj(r) = lt^+£^+r<Tj indicates whether infection in j can be observed 
when i reaches infectiousness age r, so infectious contact from i to j 
can be observed at time tij only if yij{Tij) = 1. 
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Since liir), Sij{T), and yij^r) are left-continuous, 



(1) 



is a left-continuous process that indicates the risk of an observed infectious 
contact from i to j when i reaches infectiousness age r. 

The assumptions made in the stochastic S(E)IR model above ensure that 
Xj(r) and 5jj(r) independently censor Tjj. The methods in this paper also 
assume that yijir) independently censors Tij. When who-infects-whom is 
observed, Tij can be any stopping time with respect to the history gener- 
ated by Xj(T), Sij{T), and other processes that independently censor Tij. Our 
assumptions can be relaxed as long as independent censoring of Tij is pre- 
served (Kenah, 2012). When who-infected-whom is not observed, we require 
that yij{tj — ti — Ei) = 1 for all i £ Vj for each non-imported infection j. 

1.3. Nonparametric survival analysis of epidemic data. Assume there is 
a hazard function A(t) such that \ij{T) = A(r) for each ij such that Cij = 1. 
Let A(r) = J"J^ \{u)<iu be the corresponding cumulative hazard function. Ke- 
nah (2012) extended the Nelson- Aalen estimator to obtain a nonparametric 
marginal Nelson- Aalen estimator of A(r). This derivation used counting pro- 
cesses and martingales defined in infectiousness age. 

For each ordered pair ij, let Mij{T) = '^Tij<T indicate whether infectious 
contact from i to j occurs by infectiousness age r in person z, with Nij{T) = 
for all r if i is never infected. Then Mij is continuous from the right with 
left-hand limits (cadlag) and A/'ij(0) = 0, so 



is a mean-zero martingale. Since we can observe infectious contact from i to 
j only if j is susceptible and under observation. 



(2) 



M^j{T) =Nij{T) 



I 

Jo 



■T 



CijZi{u)X{u) du 



(3) 




counts observed infectious contacts from i to j and 




is a mean-zero martingale. 
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1.3.1. Who-infects-whom is observed. The number of contact intervals 
of length > r that were observed is 

n 

(5) Y{t) = Y^J2^,,{t), 

which is decreasing and left-continuous. When who-infects-whom is ob- 
served, we can calculate the Nelson-Aalen estimator 

where N{t) = J2]=i Ei^i For all r such that Y{t) > 0, this is an 

unbiased estimator of A(r) because 

(7) A{r)-A{r) = l^^^^dMiu), 

where M[t) = Ej^^ Mjj(r) is a mean-zero martingale. When the 

contact interval distribution is continuous, the variance of A(r) — A(t) can 
be estimated using its optional variation process 

(8) <,\r) = [^,mu). 

1.3.2. Who-infects-whom is not observed. When who-infected-whom is 
not observed, we cannot calculate the Nelson-Aalen estimate because we 
do not know which contact intervals are censored and which are observed. 
Fortunately, 

(9) A(r)=E[A(r)] = E[E[A(t) | observed data] ] 

by the law of iterated expectation, so we can estimate A(r) by estimating 
the mean of the possible Nelson-Aalen estimates. When the contact interval 
distribution is continuous, the probability that j was infected by person i 
given the observed history up to time tj is 



(10) Pij{\) = = — f- 

l^keVj ^kj\S-j -i-k- £k) 

Since the infector of each infected j can be chosen independently given 
the observed data (Kenah, Lipsitch and Robins, 2008), the probability of 

V = {vi, . ..,Vn) is 

(11) Pr(v|observed data) = Pvjj- 

j: 0<Vj<oo 
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Let Av(r) denote the A(t) that we would have calculated had we observed 
the transmission network v. Then 

(12) A(t) = ^ Av(r) Pr(v|observed data) 

vev 

is an unbiased estimate of A(r) for ah r such that Y(t) > 0. We call this 
the marginal Nelson- Aalen estimate. 

Since the true hazard function A(r) is unknown, we cannot use equa- 
tion (12) directly. Instead, we use it as part of an EM algorithm that starts 
from an initial guess at the hazard function. Given a hazard function A^'^^(r), 
let 

(13) A^-.(t|A('=)) = Y.p.,{X('^'>)U>t,-u-e.. 

Then the marginal Nelson- Aalen estimate given IS 

(14) A(-|A('=))=j['^diV(u|A('=)), 

where iV(r|AW) = Ej=i ^-i (^| A^''^) • We can smooth the increments of 
^(rlAC^)) to estimate a new hazard function A*-^'*'^\ and so on. Iterating 
from an initial X^^\t) leads to Algorithm 1, which turns out to be EM 
algorithm. The limit of the sequence A('=)(r) is the marginal Nelson-Aalen 
estimate A(r) (Kenah, 2012). 



Choose an initial A^°^(r); 
Set k = 0- 

while convergence critenon not met do 

E-step: Calculate infector probabilities Pij(A'*-'); 
M-step: Calculate A^''\t) = A(r|A''='); 
Smoothing step: Smooth A''°^(t) to obtain A*'""''^' (r); 
Set fc = + 1; 

end 

Algorithm 1: EM algorithm for nonparametric estimation of A(r) using 
data from a homogeneous population. 

The variance of A(r) can be estimated using the conditional variance 
formula. Conditioning on the transmission network v, we have 



(15) 



?2(r) =E[a2(r)] +Var(Av(r)), 
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where Av(t) is the Nelson- Aalen estimate from equation (6) and d'^ir) is 
the variance estimate from equation (38) that we would have calculated had 
we observed the transmission network v. This reduces to (Kenah, 2012) 



2. Methods. The methods of Section 1.3 assume a homogeneous pop- 
ulation in the sense that Aij(r) is the same for all ij such that Cij = 1. Now 
consider a semiparametric relative-risk model like that of Prentice and Self 
(1983) in which 



where Ao(t) is an unspecified baseline hazard function, r : M — )• (0, oo) is a 
relative risk function, /3o is an unknown 6x1 coefficient vector, and Xjj(r) 
is a 6 X 1 predictable covariate process taking values in a set X. We assume 
that r has continuous first and second derivatives, r(0) = 1, and lnr(/3"'"X) 
is bounded on X. Letting r(x) = exp(x) gives us a loglinear relative risk 
regression model like that of Cox (1972), and letting r(x) = 1 + x gives us 
a linear relative risk regression model. 

To fit these semiparametric models, we adapt the nonparametric estima- 
tors from Kenah (2012) to account for the relative risk function. First, we 
consider the case where who-infects-whom is observed. Then we describe an 
EM algorithm to handle the case where who-infects-whom is not observed. 

2.1. Who-infects-whom is observed. Let Ao(r) = Ao(^f)dn. For a given 
/3, the Breslow estimator (Breslow, 1972) of Aq{t) is 



(17) 



A.i(T) = r(/3o^X,,(r))Ao(r) 



(18) 




where 



n 



(19) 




i=i i¥=j 



This estimator has two desirable properties. First, A(/3o,t) is an unbiased 
estimator of Aq{t). For all r such that Y(t) > 0, 
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where 

(21) M(/3o, r) = N{t) - T Y{(3o, n)Ao(u) du. 

Jo 

is a mean-zero martingale. Second, A(/3, r) maximizes the log likelihood 

(22) £(/3,A) = X;El'^(K/3^^-..K.))dA(T,^.,))- ry(/3,u)dA(n), 

over all step functions A(r). Substituting A(/3, r) into £(/?, A), we get the log 
profile likelihood 

where T = max{r : Y{t) > 0}. The first term is similar to the log partial 
likelihood from Cox (1972) and the second term does not depend on /3. 
Dropping the second term, let 

(24, pm-ty-^^^ 

be the log partial likelihood for (3. This derivation of the partial likelihood as 
a profile likelihood follows that of Johansen (1983). Let /3 denote the value 
of f3 that maximizes pl{l3), and let Ao(t) = A(/3, r) denote the corresponding 
Breslow estimate of the baseline cumulative hazard. 

2.1.1. Partial likelihood score process. We can rewrite pl{(3) as a sum of 
stochastic integrals: 

The corresponding score process is 

(26) Ui(3,T) = ^Yll Qo^^r{^''X,,iu))-Ei(3,u)dN,,iu), 
j=l i^j 

where 

, _ E"=iE.^,K/3^^..(^))^..(^)ilnr-(/3-^X.,(^)) 
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is the expected value of ^ hir(^j3^ Xij{uj) over the risk set at u when each 
pair is weighted by its hazard of transmission at u. By the Doob-Meyer 
decomposition, there is a mean-zero martingale Mij{u) for each ij such that 

(28) dNij{u) = r 

Expanding equation (26) using this decomposition and simplifying, we get 

Since it is a sum of integrals of predictable processes with respect to mar- 
tingales, ?7(/3o,r) is a mean-zero martingale. 

2.1.2. Observed and expected information. Since the Nij{T) do not jump 
simultaneously in continuous time, the predictable variation process of U{Po, r) 
is 



(30) (f/(/3o))(r) = / V{(3o,u)Y{f3o,u)Xo{u)du, 

where 







^) = E E 1- YW,u) ) 



d , r{f5-'X,,{u))Y^r{f3^X,,{u))Y,{u) 



y(/3,n) 



is the variance of ^ lnr(/3"'"Xjj(u)) over the risk set at u when each pair ij 
is weighted by its hazard of transmission at u. 

Let I{/3) = —-^pl{(3) be the observed information. Then 

where v^"^ = vv'^ for a column vector v [v"^ for scalar v). Expanding /(/3o) 
via the Doob-Meyer decomposition (28) and simplifying, we get 



oo 



I{M = / Vi{3o,u)Y{l3o,u)Xo{u)du 
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The second term has expectation zero, so /(/3o) is an unbiased estimate of 
the variance of C/(/3o,oo). 

Another estimate of Var ([/ (/3o, oo)) is obtained by substituting the incre- 
ments of the Breslow estimator (18) for Ao(u)dM in equation (30). This gives 
us the (estimated) expected information 

POO 

(34) X(/3) = / V{/3,u)dN{u). 

Jo 

Expanding X[(3q) using the Doob-Meyer decomposition and simphfying, we 
get 

/•oo /-oo 

(35) X(/3o)= / V{po.u)Y{^Q,u)Xo{u)du+ y(/3o, n) dM(n). 

Jo Jo 

The second term has expectation zero, so X(/3o) is also an unbiased es- 
timate of the variance of U{f3o,oo). I{f3o) rnay be a better estimator of 
Var(f7(/3o, oo)) than I(/3o) because it is guaranteed to be positive semidef- 
inite (Prentice and Self, 1983) and it depends only on aggregates over risk 
sets (Aalen, Borgan and Gjessing, 2009). 

When r(x) = as in the Cox model, lnr(/3"''X) = for all /3 and X 

so 

(36) I {(3) = ^lny(/3,u) diV(n). 
Since 

for all /3 and X, we have ^lny(/3,u) = V{(3,u). Therefore, /(/3) = I{/3) 
for all (3. For general r (/3"'"X) , I{(3q) and X(/3o) are asymptotically equivalent 
under weak regularity conditions (see Appendix A). 

2.1.3. Large-sample estimation of (3q and Kq[t). Appendix A outlines 
sufficient conditions for the asymptotic normality of U{I3q,t) and /? as m — )• 
oo, where m is the number of pairs ij at risk of transmission. These are 
the same conditions required for asymptotic normality in standard survival 
data, except for the requirement that m is much larger than the largest 
number of infectors to which any single susceptible is exposed. Under these 
conditions, hypothesis tests and confidence intervals for (3q can be obtained 
using score, Wald, or likelihood ratio statistics. 
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Given /3, the Breslow estimator of Ao(r) is Ao(t) = Ao(/3,r). Its variance 
is consistently estimated by 

(38) aUr) = (|a(A (|A(ft .)) + [ ^ dA.(„), 

which is derived in Appendix B.l. can be replaced by X(/3). Using 

the martingale central limit theorem and a log transformation, we get the 
approximate pointwise 1 — a confidence limits 

,30, A„M..p(±?|l.-(i-|)). 

Point and interval estimates for the baseline survival function can be ob- 
tained using the product integral (Aalen, Borgan and Gjessing, 2009) or 
using So{t) = exp ( — Ao(t)). These estimates are asymptotically equiv- 
alent, but the latter is more consistent with the derivation of the partial 
likelihood as a profile likelihood. 

2.2. Who-infects-whom is not observed. If we observe infection but not 
who-infects-whom, we cannot calculate the partial likelihood pl{l3) or the 
Breslow estimate A(/3, r) because we do not know which contact intervals 
are observed and which are censored. However, we can use an EM algorithm 
similar to that of Kenah (2012) to obtain consistent and asymptotically 
normal estimates of /3o and Ao(r). 

Given a coefficient vector /3 and a baseline hazard function A(r), we can 
calculate Pr(v|observed data) for each v G V (Kenah, Lipsitch and Robins, 
2008). If j is infected at time tj, the probability that j was infected by i 
given /3 and A(r) is 

(40) p,-(/3. A) - '^^'""'^^'^ - - ''^^^^'^ - - '^^'"^^ 



Y.k&v, r{P'^Xkj{tj -tk- ek))Xitj -tk- Ek) ■ 

The infectors of different infected persons can be chosen independently, so 
the probability of a transmission network v = {vi, . . . ,Vn) given /3, A(r), 
and the observed data is 

(41) Pr(v|/3, A, observed data) = J| Pi,,j(/3, A). 

j: 0<Vj<oo 



Note that these equations assume a continuous contact interval distribution, 
so simultaneous infectious contacts have probability zero. 
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Let plyf{(3) be the log partial likelihood that we would have calculated had 
we observed the transmission network v. Given a coefficient vector /3* and 
a baseline hazard function A*(r), the expected log likelihood is 

^p/v(/3) Pr(v|/3*, A*, observed data) 
vev 

where %(r|/3*,A*) = pij{(3* , X*)lr>tj-u-e,- Now let iV(r|v) be the value 
of N{t) that we would have calculated had we observed the transmission 
network v and let the corresponding Breslow estimate be 

(43) Uf,,r)^[^^m(uW). 

Then the the marginal Breslow estimate given /3* and A*(r) is 

V,A-(/3,-^) = ^Av(,3,r)Pr(v|/3*, A*, observed data) 
vev 

(44) .^^_l_d/VW-,A-), 

where N(t\D-, A") = EJ.i E.^, JVy(r|/f, A*). 

For the relative risk function r(x) = e^, the expected log partial likeli- 
hood plp* x*{P) is the log partial likelihood of a weighted Cox regression 
model (Therneau and Grambsch, 2000) with two copies of each pair ij: 
an uncensored copy with weight Pij{/3* , A*) and a censored copy with weight 
1 —pij{f3* , A*). The baseline hazard estimate from this model is the marginal 
Breslow estimate Aj3*^x*{(3,T), where /3 = argmax^pZ^* (/3). 

2.2.1. EM algorithm. When who-infects- whom is not observed, the semi- 
parametric regression model can be fit using the ECM algorithm of Meng 
and Rubin (1993), which is an extension of the EM algorithm of Dempster, 
Laird and Rubin (1977). In each iteration, we first estimate /3q using the 
expected log partial likelihood and then calculate the marginal Breslow es- 
timator of Ao(t). We then use these new estimates to re- weight the possible 
V. The entire process is described in Algorithm 2. 

To show that this is an ECM algorithm, we must show that the CMl and 
CM2 steps are conditional maximizations of the expected log likelihood. 
Since the CMl step is a conditional maximization by definition, it remains 



(42) 
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Choose an initial ^(f' and A('"(r); 
Set = 0; 

while convergence criterion not met do 

E-step: Calculate infector probabilities pij (/3'*' , A^''') ; 
CMl-step: Find /3(*+i) = argmax/3 p/^ct) ^^ct) (/3) ; 
CM2-3tep: Calculate A'^+^'M = A^(fe) (/J^'^+i' , r) ; 
Smoothing step: Smooth A''°+^^(r) to obtain A''=+^'(r); 
Set fe = + 1; 

end 



Algorithm 2: ECM algorithm for semiparametric estimation of /3o and 
Ao(t) in a heterogeneous population. 



to show that the CM2 step is a conditional maximization. Given a coefficient 
vector (3* and a hazard function A*, the expected log likelihood is 

n 

V,A*(/3, A) = J] A*) In (^r{l3^ Xij{t, - U - e^)) dA{tj -t,- e,) 



(45) - / Y{P,u)dA{u). 

Jo 

Differentiating with respect to dA{tj —ti — Ei) for each i and j shows that, for 
a fixed /3, ^/3*^a* A) is maximized over all step functions A(r) by setting 

exactly as in the marginal Breslow estimator A^*^a* (/3) ''")• Therefore, Algo- 
rithm 2 is an ECM algorithm. When it is known that /3 = 0, it reduces to 
Algorithm 1, which shows that convergence of both /S^'^) and A^^\t) should 
be monitored to ensure convergence of the ECM algorithm. 

2.2.2. Large-sample estimation of (3q. Let /3 denote the estimate of /3o to 
which the ECM algorithm converges, and let A(r) denote the corresponding 
estimate of Ao(t). Let C/v(t, /3) and Iv(/5) denote the score and the observed 
information that we would have calculated had we observed the transmission 
network v. Using the methods of Louis (1982), the observed information is 

(47) m = E~^~^ [ U0) ] - E~^~^ [ U^il oo)®2 ] ^ 

where E^ ;j[-] denotes an expectation taken under the assumption that the 
true coefficient vector is /3 and the true baseline hazard function is A(r). 
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The first term in (47) is 



where Nij{u) = Nij{u\f3, X). This is the observed information matrix from 
a weighted regression model where each ij has an uncensored copy with 
weight Pij{l3, A) and a censored copy with weight 1 — Pij{f3, A). To evaluate 
the second term in (47), let 

be the expected score contribution from individual j as a susceptible. Then 

E^^5;[[/(/3,oo)®2] is 

because X]j=i ^-jiP^ °o) = 0, each infected person j has only one infector in 
any v, and the infectors of different individuals can be chosen independently. 

2.2.3. Large-sample estimation o/Ao(r). Let Ao(t) be the marginal Bres- 
low estimate obtained after convergence of the ECM algorithm. Its variance 
is consistently estimated by 

(51) 5g(r) = (^^~a(^'-)) V)-'(^^~a(^'-)) 

(52) + 2 r dN{u) -y(r dN.jiu)] , 

Jo y(/3,tx)2 Vio y(/3,n) 7 

where N.j{u) = Yli^^j (s^e Appendix B.2). Using the martingale cen- 

tral limit theorem and a log transformation, we get the approximate point- 
wise 1 — a confidence limits 

,53, A„Me.p(±|^.-(l-|)). 

As before, point and interval estimates for the baseline survival function can 
be obtained using the product integral (Aalen, Borgan and Gjessing, 2009; 
Kenah, 2012) or using Soir) = exp ( - Ao(t)). 
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3. Simulations. The performance of the methods from section 2 was 
tested with a series of 12000 network-based epidemic simulations. All epi- 
demics took place on a Watts-Strogatz small-world network (Watts and 
Strogatz, 1998), which mimics the high clustering and low diameter of real 
human contact networks. Starting with a ring of 50000 nodes, each node was 
connected to its 10 nearest neighbors and each edge was rewired to a ran- 
domly chosen node with probability 0.1. A new contact network was built 
for each simulation. 

All epidemic models were written in Python 2.7 (www.python.org) using 
the packages NetworkX 1.6 (networkx.lanl.gov), NumPy 1.6, and SciPy 
0.9 (www. scipy . org). Statistical analysis was done in in R 2.15 (www. r-project . org) 
via the Rpy2 2.2 package (rpy . sourcef orge .net). The code for the models 
is available as Online Supplementary Information. 

3.1. Transmission model. The transmission model had a latent period of 
zero and an exponential infectious period with mean one. The baseline con- 
tact interval distribution was Weibull(a, 7), where a is the shape parameter 
and 7 is the rate parameter. 6000 simulations had a Weibull(0.5, 0.2) distri- 
bution, which has Ao(r) = (0.2r)°-5. The other 6000 had a Weibull(2, 0.6) 
distribution, which has Ao(r) = (0.6r)^. These distributions gave i2o ~ 3 in 
a null model. 

In the transmission model, each person i had an infectiousness covariate 
Xf^^ and a susceptibility covariate X?^. Each pair ij connected by an edge 
had a pairwise covariate XfJ^^^. All covariates were independent Bernoulli(.5) 
random variables. For a connected pair ij, the hazard of transmission from 
i to j at infectiousness age r of i was 

(54) \j (r) = exp ( AnfXff + /3sus^r + /^pair^f/") Ao (r ) 

For each parameter /3, there were 4000 simulations where its true value was 
chosen from a uniform distribution on (—1,1). Of these, 2000 simulations 
used the Weibull(0.5, 0.2) basehne hazard and 2000 used the Weibull(2, 0.6) 
baseline hazard. Of the 2000 simulations for each baseline hazard, 1000 had 
the other two /3 set to and 1000 had the other two f3 set to 1. 

Each simulated epidemic began with a single person infected at time 0. 
Data from the next 1000 infections was used to fit two regression models, one 
using information on who-infected-whom as in Section 2.1 and one using an 
EM algorithm as in Section 2.2. The EM algorithm used a minimum of 2 and 
a maximum of 25 iterations. At each iteration, a weighted Cox model was 
run using the last parameter estimates as the initial parameter estimate. 
Convergence was defined as a change less than 0.002 in the expected log 
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likelihood (tighter convergence criteria yielded nearly identical parameter 
estimates). After convergence, a Cox model was run using the final weights 
and initial parameters /3inf = /3sus = /5pair = 0. 

After each simulation, we recorded true values, estimates, and confidence 
intervals for each (3 in the model and baseline hazard estimates and con- 
fidence intervals at the 10*'', 25*'', 50*'', 75*'', and 90*'' percentiles of aU 
possible (censored and uncensored) contact intervals. We also recorded the 
a and 7 of the baseline hazard function and the number of EM iterations. 

3.2. Results. Figure 1 shows good agreement between the estimated and 
true /3inf, /3sus) and /3pair for both /3 and /3. Table 1 shows 95% confidence 
interval coverage probabilities above .91 for all combinations of baseline 
hazards and parameters. The lower right panel of Figure 1 shows that this 
was achieved with relatively few iterations. The median number of iterations 
was 6, 98% of simulations required < 10 iterations, and only 3 out of 12000 
simulations failed to converge. 

Figures 2 and 3 show good agreement between the estimated and true 
baseline hazard for both Ao(t) and Ao(t). The smoothed means show almost 
no bias in Ao(t) or Ao(r) for a = .5 and a slight upward bias at high r for 
a = 2. Table 2 shows good 95% confidence interval coverage probabilites for 
the baseline hazard with shape parameter a = .5 but much poorer coverage 
probabilities for the baseline hazard with a = 2. When a = 2, the baseline 
hazard function is changing fastest at high r, where there is the least data. 
Also, the estimated Ao(t) and its confidence limits were evaluated as step 
functions; coverage probabilities may have been higher had smoothing or 
interpolation been used. 

Figure 4 shows the widths of confidence intervals for /?inf versus /3inf, /3sus 
versus /3sus, /3pair versus /3pair, and Ao(r) versus Ao(t). Knowledge of who- 
infects-whom improves the precision of j3\ni and /3pair estimates but not /3sus 
estimates; it slightly improves the precision of Ao(t) estimates. The baseline 
hazard plays an important role in determining how much precision is gained, 
with a larger gain for a = 0.5 than for a = 2. The confidence intervals for 
Anf) /3sus, /3pair> and Ao(r) have slightly lower coverage probabilities than 
those for Anf) Aus, /3pair; and Ao(r) (see Tables 1 and 2), so these plots 
underestimate the true precision gained when who-infects-whom is observed. 

Knowledge of who-infected-whom allows point estimates that are closer to 
the truth and interval estimates with better coverage probabilities. However, 
it is remarkable how much information can be recovered by the EM algorithm 
when who-infected-whom is not observed, making the iterative regression 
model of Section 2.2 a promising tool for infectious disease epidemiology. 
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4. Data Analysis. To show how the methods of Section 2 can be ap- 
phed, we will look at the effect of antiviral prophylaxis and age on the 
transmission of pandemic influenza A(HINI) in Los Angeles County in 2009. 
The Los Angeles County Department of Public Health (LACDPH) collected 
household surveillance data between April 22 and May 19, 2009 according 
to the following protocol (Sugimoto et al., 2011): 

1. Nasopharyngeal swabs and aspirates were taken from individuals who 
reported to the LACDPH or other health care providers with acute 
febrile respiratory illness (AFRI), defined as a fever > 100°F plus 
cough, core throat, or runny nose. These specimens were tested for 
influenza, and the age, gender, and symptom onset date of the AFRI 
patient were recorded. 

2. Patients whose specimens tested positive for pandemic influenza A(HINI) 
or for influenza A of undetermined subtype were enrolled as index 
cases. Each of them was given a structured phone interview to collect 
the following information about his or her household contacts: age, 
gender, type of contact (household, intimate, in-home daycare, non- 
home daycare), and high risk status (pregnant, child on long-term 
aspirin therapy, immunosuppressed, or history of a chronic cardiac, 
pulmonary, renal, liver, or neurologic condition). The interviewer also 
recorded whether prophylactic antiviral medication was being taken 
by the household contacts. They were asked to report the symptom 
onset date of any AFRI episodes among their household contacts. 

3. When necessary, a follow-up interview was given 14 days after the 
symptom onset date of the index case to assess whether any additional 
AFRI episodes had occurred in the household, including their illness 
onset date. 

There were 58 households with a total of 299 members. There were 99 infec- 
tions, of whom 62 were index cases (4 of the 58 households had co-primary 
cases) and 27 were household contacts with an AFRI. For simplicity, we as- 
sume these were all influenza A(HINI) cases and that all household members 
were susceptible to infection. 

Our natural history assumptions are adapted from Yang et al. (2009) and 
identical to those in Kenah (2012). In the primary analysis, we assume an 
incubation period of 2 days, a latent period of days, and an infectious 
period of 6 days. Under these assumptions, a person j with symptom onset 
at time was infected at time tj = t^^™ — 2 and will stop being infectious 
at time tj + 6 = i^-^™ -|- 4. Under these assumptions, person j can transmit 
infection on days tj -|- 1 to tj -|- 6. In a sensitivity analysis, we vary the latent 
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period from to 1 days, and the infectious period from 5 to 7 days. 

Here, we use the regression model of Section 2.2 to estimate the in- 
fluenza transmission hazard ratios for age in the infectious and the sus- 
ceptible and the hazard ratio for antiviral prophylaxis in the susceptible. 
We then estimate transmission probabilities for different combinations of 
covariates in infectious/susceptible pairs. The variables in the regression 
models are: ageinf = if the infectious person is < 18 years old and 1 oth- 
erwise, agCsus = if the susceptible is < 18 years old and 1 otherwise, and 
prophsus = if the susceptible is not on antiviral prophylaxis and 1 other- 
wise. Since antiviral prophylaxis was initiated after the initial case in each 
household, it was considered only as a susceptibility covariate. All statistical 
analysis was done in R 2.15 (www.r-project.org). 

4.1. Results. There were 114 people aged < 18 years and 185 aged > 18 
years, with no missing age data. There were 91 people taking antiviral pro- 
phylaxis and 152 not taking prophylaxis, with missing prophylaxis data for 
56 people. When who-infects-whom is not observed, a complete-case anal- 
ysis requires the removal of all rows corresponding to infectious-susceptible 
pairs ij where i S Vj and any member of Vj is missing data. Otherwise, the 
remaining members of Vj get too much credit for the infection of j. 

In the main analysis, there were 70 people infected from outside the house- 
hold (i.e., no possible infector in the household), 16 with 1 possible infector, 
7 with 2 possible infectors, 4 with 4 possible infectors, and 2 with 8 possible 
infectors, giving us 1^^ x 2^ x 4^ x 8^ = 2097152 possible transmission trees. 
The pairwise data contains 443 infectious-susceptible pairs with a total of 
2455 pair-days at risk of infection. Of these, 16x1 + 7x2 + 4x4 + 2x8 = 62 
rows represent possible infection events. All models used the Efron approx- 
imation for the partial likelihood with tied failure times. 

The top panel of Table 3 shows the results of seven models. All of the 
models including prophylaxis suggested that antiviral prophylaxis reduced 
the hazard of transmission by about 60%, with low p-values. Multivariable 
and stratified models with interaction suggest a stronger effect of antiviral 
prophylaxis on transmission to and from adults than on transmission to and 
from children. However, the interaction term coefficients had high p-values 
and wide confidence intervals (not shown). In all models, adults appeared 
more infectious and less susceptible than children. However, the coefficients 
for the main effect of age also had high p-values and wide confidence inter- 
vals. The bottom panel of Table 3 shows the results of a sensitivity analysis 
with the multivariable model without interaction. Varying the latent and 
infectious periods has relatively little effect on the results of the model. 
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Figure 5 shows estimates of the cumulative transmission probabihty based 
on the multivariable and stratified models without interaction. The results of 
the two models are similar, but the stratified model generally showed slightly 
lower probabilities of transmission from children and higher probabilities 
of transmission from adults than the multivariable model. All four panels 
clearly show the estimtated effect of antiviral prophylaxis. Comparing the 
top and bottom rows shows that children are estimated to be less infectious 
than adults. Comparing the left and right columns shows that children are 
estimated to be more susceptible than adults. All curves show bigger jumps 
on the first four days after infection than on days 5 and 6, which is consistent 
with the results of Kenah (2012). 

This data analysis has been intended primarily to illustrate the flexibility 
of the regression modeling framework for the analysis transmission data. 
There are several important limitations of the analysis itself. The data set is 
not large, so there is limited power to estimate the effects of age and antiviral 
prophylaxis. The age classification is crude, so it may not accurately capture 
the true effects of age. The prophylaxis variable was missing for many pairs 
and it was modeled as a binary variable, allowing no consideration of the 
timing of prophylaxis relative to exposure. Earlier analyses of household 
transmission of influenza A(H3N2) found greater child-to-child than adult- 
to-adult transmission (Addy, Longini and Haber, 1991). In our analysis of 
influenza A(HINI), we found that children are less infectious and more 
susceptible than adults. This could be a difference between the H3N2 and 
HlNl subtypes of influenza A, or it could be a bias caused the failure to 
account for infection from outside the household. In either case, this analysis 
shows the need for two important extensions to the modeling framework: 
The ability to handle missing data more flexibly and the ability to model 
infection from outside the household. 

5. Discussion. Compared to the discrete-time chain binomial model 
of Rampey et al. (1992), the regression model framework proposed here has 
several advantages. It can be fit using standard statistical software in a way 
that resembles a standard regression model. It offers all of the modeling tools 
available in a multivariable Cox regression framework, such as stratification 
and interaction. Standard software can be used to convert the results into 
curves representing the cumulative probability of transmission in pairs of 
individuals with specific characteristics. This ease of use will encourage the 
adoption of these methods in biomedical and public health research. 

There are two immediate extensions that will be required before the 
relative-risk regression models presented here can become truly useful tools 
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in infectious disease epidemiology. First, we must be able to simultaneously 
model the process of infection from outside the household and transmission 
within the household. The discrete-time chain binomial model can include 
a per-time-unit probability of escaping infection from outside the house- 
hold. In the iterative regression model, this could be achieved by fitting 
two models in each step of the EM algorithm: a pairwise contact interval 
model within the household and an individual-level absolute-time model for 
infection from outside the household. In iteration k of the EM algorithm, 

(k) 

an individual j who got infected would have a probability Pq^ that he or 
she was infected from outside the household. The weights of the possible 

(k) 

infectors within the household would add up to 1 — Pqj ■ At each step, the 
weights would be recalculated based on covariates, coefficient estimates, the 
baseline hazard of the contact interval distribution, and the baseline hazard 
of infection from outside the household. Second, we must be able to handle 
missing data flexibly but rigorously. Missing data on infection times, latent 
periods, and infectious periods is the rule, not the exception, in infectious 
disease epidemiology. For simple missing data (such as the missing data on 
antiviral prophylaxis in Section 4), the EM algorithm could be extended to 
calculate the expected log likelihood over the possible values of the miss- 
ing data as well as who-infected-whom. A more general solution for missing 
data, especially missing infection and removal times, would be to use a profile 
sampler (Lee, Kosorok and Fine, 2005) for the model coefficients, treating 
the baseline hazards as a nuisance parameter. 

Other extenions of the theory and methods presented here would make 
the regression framework presented here more broadly applicable. The SEIR 
framework is best suited to acute, immunizing diseases that spread directly 
from person to person. Many foodborne and waterborne diseases, pneumo- 
coccal and meningococcal diseases, and other infectious diseases of major 
public health importance do not fit easily into this framework. The first 
limitation could be addressed by allowing individuals to experience multiple 
events (first infection, the second infection, etc.) and allowing individuals 
to experience different types of events (new carriage, new infection, relapse, 
etc.). In this paper, we assumed that contact intervals are independent of 
infectious periods. In some cases, there may be a covariate process X{t) 
such that liir) and Nij{T) are independent given X{t^). If not, infectious 
contact and the infectious period could be modeled as a multivariate survival 
process. The flexibility of the theory of counting processes and martingales 
will be valuable in extending the model to more complex diseases. 

Finally, there are technical issues that deserve more study. The smoothing 
step is crucial to the iterative regression model. Here, we used cubic smooth- 
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ing splines because they were convenient and worked well. However, these do 
not guarantee that the smoothed hazard function is monotonically increas- 
ing and do not have a convenient interpretation in terms of the likelihood. 
A penalized likelihood estimator that guarantees monotonicity (Anderson 
and Senthilselvan, 1980) would be more consistent with the theoretical jus- 
tification of the EM algorithm. A more careful study of the asymptotics of 
this model would also be useful, especially as the model is extended to more 
complex applications. Since most infectious disease data is discrete (by day, 
by week, etc.), a detailed comparison of regression models with correction 
for ties versus the discrete-time chain-binomial model is important. 

Despite these limitations, semiparametric relative-risk regression is a pow- 
erful new framework for the analysis of infectious disease data. Its flexibility 
will allow statistical methods in infectious disease epidemiology to develop 
in concert with advances in molecular biology. Since all calculations in the 
EM algorithm are sums over possible combinations of who-infected whom, 
phylogenetic data can be incorporated directly by restricting the sums to 
transmission trees compatible with the phylogenetic tree. By placing the 
analysis of infectious disease data on the theoretical foundation of survival 
analysis, this approach may help clarify causal inference in infectious disease 
epidemiology, allowing better design of observational studies and interven- 
tion trials. Statistical methods that help improve the response to emerging 
or re-emerging infections could protect human health and commerce from 
unknown but possibly tremendous dangers. 
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APPENDIX A: CONSISTENCY AND ASYMPTOTIC NORMALITY 

The conditions for the consistency and asymptotic normality of /3 and 
Ao(t) in the Cox model were given in Andersen and Gill (1982), which used 
martingales to simplify and generalize the asymptotic results of Cox (1975) 
and Tsiatis (1981). Conditions for the more general relative risk model were 
given in Prentice and Self (1983). Here, we outline the most important of 
these conditions and point out their implications for the use of relative risk 
regression models in infectious disease epidemiology. 

A.l. Regularity conditions. Assume all observations take place at 
infectiousness ages in [0, T] for some finite T- Let m = Y{0^) = limi-^o ^i'^) 
be the number of pairs ij that were at risk of infectious contact from i to 
j while under observation. Let rim denote the number of individuals that 
constitute the m pairs. Define the following functions (Prentice and Self, 
1983): 

S^^\P,t) = -Yil3,r) = j;r(/3"rX.,(r))y.,(r), 
III' 

Sl\fi.T) = ^.St\r) = -5]x.,(r)r'(/3"^X.,(r))y.,(T), and 

Sg\^,T) = - J]^X.,(r)®2((lnr)'(/3Tx.,(r))) r^fi' X.,,{t))Y,,{t). 
Ill' . -. ., . ^ ' 



Note that is real-valued, is 6 x 1 vector-valued, and is b x b 
matrix- valued. Now let 

(55) Em{f3,r) = ^^^^ and 

(56) K.(/3,r) = 4!^-^".(/3,^r 
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be the values of E{(3,t) and y(/3,r), respectively, based on observations of 
m pairs at risk of transmission. 

For consistency and asymptotic normality of y/rn{(3 — (Hq), we have the 
following sufficient conditions (Andersen and Gill, 1982; Prentice and Self, 
1983): 

A. (Finite interval) Ao(7~) < oo. 

B. (Regression function positivity) There exists a neighborhood Bq of 
such that r(/3"'"Xjj(T)) is locally bounded away from zero for all ij and 
all P G Bo. 

C. (Asymptotic stability) There exists a neighborhood B Bq of (3q and 
functions s^^\ s^^\ s^'^^ defined on ;B x [0,7^ such that 



(57) sup \\S^J;){f],r)-s'^'^\f3,T)\\^0asm 



oo 



/3eB,re[o,T] 



for k = 0,1,2. Here, ||x|| is |x| for real x, max (|xi|, . . . , for vector 
X, and max (|a;ii|, . . . , \xhi,\) for matrix x. Asymptotic properties of the 
Cox model depend only on convergence of these three functions. For more 
general relative risk functions, convergence of four additional functions 
is also required (Prentice and Self, 1983). 

D. (Asymptotic regularity) The functions s^^^ {I3,t), . . . , s^'^^ (/3, r) are bounded 
on B X [0,T] and continuous in /3 uniformly in r. In addition, s^^^ is 
bounded away from zero and has first and second derivatives with re- 
spect to ^ on ;B X [0,7^. Finally, let e(/3,r) = ^[o|||^'^| and v{J3,t) = 

Sf;^-e(/3,r)«^Then 

rT 

(58) S= / ^;(/3o,n)sW(/3o,n)Ao(n)du 

Jo 

is positive definite. 

E. (Asymptotic stability of the observed information matrix) 

(59) 

sup / — VV||X,,(tx)f ((lnr)"(/3Tx,,(tx))) r(/3o^X,,(n)) Ao(u)dn 

Ao. 

F. (Lindeberg condition) 

p 



(60) ^sup X,j{T)ilnry{l3^Xij{T)) 



0, 
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where the supremum is over all r G [0, T] and all ij such that l^j(r) = 1. 

Condition F is automatically fulfilled if the covariates Xij are bounded. 
In the Cox model, conditions B and E are automatically fulfilled because 
exp(x) > and (lnr)"(x) = for all real x. With only slight modification, 
these conditions also guarantee consistency and asymptotic normality in a 
stratified relative-risk regression model (Andersen and Borgan, 1985). 

For the methods in this paper, the most important constraint is that 
s*-*^) (/3, r) is bounded away from zero. This has two implications for infectious 
disease data that have no counterpart in standard survival data. First, the 
infectious period must be > T with positive probability. Second, the number 
of infectors to which the susceptible j in a randomly chosen pair ij at risk 
of transmission is exposed must have a finite mean as m — t- oo. Let 

(61) Y.,{t) = Y,Yvir) ^Y.^.j{0+) = m. 

Now let Dij = Y.j{0^)Yij(0^) be the number of infectors to which j is 
exposed if ij was at risk of transmission and Dij = otherwise. If we 
randomly choose a pair ij at risk of transmission and look at the number of 
infectors to which j is exposed, its expected value is 

For s^^\(3,t) to be bounded away from zero, we must have 
(63) limsup — VY:j(0+)^ < oo. 

If not, the hazard of infection in j from a randomly chosen ij at risk of 
transmission becomes infinite as m — )• oo. Each susceptible will be infected 
at a contact interval approaching zero, so the contact interval distribution 
cannot be estimated. In practice, this means that large-sample distributions 
are useful when the number of pairs m and the number of susceptibles are 
both large and the largest value of Y.j{0^) <^ m. 

There is no similar constraint on the number of susceptibles exposed to 
each infectious person. In theory, we could have m susceptibles exposed to a 
single infectious person without violating the regularity conditions (as long 
as his or her infectious period was > T). This is because the contact intervals 
in all pairs ij for a fixed i are assumed to be independent of each other 
and independent of the infectious period of i conditional on the covariate 
processes Xij{T). 
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A. 2. Asymptotic properties of (7(/3o5 T")? /3; and Ao(t). LetUm{Po,T) 
denote the score process based on observations of m pairs ij at risk of 
transmission when who-infects-whom is observed, let f3m denote the corre- 
sponding (t) denote the 
corresponding Breslow estimate of the baseline hazard. Under the conditions 

of the last section, we have the following results as m — )■ oo (Andersen and 
Gill, 1982; Prentice and Self, 1983): 

1. Asymptotic normality of the score: -^U{Po,T) A^(0,ll). 

2. Consistency of /(/3o) and X(/3o): ^iWo) A S and ^X(/3o) A S. 

3. Consistency of /3: Prn — > /3o- 

4. Asymptotic normality of /3: y/rn{li — /3o) — > A(0, S^^) . 

5. Consistency of /(/3) and X(/3): ^/(/3) A S and ^X(^) A S. 

6. Convergence of \/m(Ao(T) — Ao(t)) to a mean-zero Gaussian process 
with independent increments. 

7. Asymptotic independence of ^^A(/3*, r)^ -y/m(/3— /3o) and m JJ" y(^p „)2 dA('u). 

8. Continuity of ^A(/3*,r): ^A(/3^,r) A ^A(/3o,r) if A /?. 

APPENDIX B: VARIANCE OF BASELINE HAZARD ESTIMATES 

Andersen and Gill (1982) showed that -y/m(Ao(T) — Ao(t)) converges to 
a mean-zero Gaussian martingale in the Cox model for standard survival 
data, and this result was extended to more general relative risk functions 
by Prentice and Self (1983). Under the conditions given in Appendix A, 
these derivations extend directly to infectious disease data. 

B. l. Who-infects-v^hom is observed. Expanding Ao(r)— Ao(r) gives 

us 

V^(Ao(r) - Ao(t)) = V^(a(/3, r) - A(/3o, r)) 

+ V^(A(/3o,T)-A5(r)) 
(64) +V^(A5(r)-Ao(T)), 

where ^q{t) = ly (u)>q^q['^) du. By a first-order Taylor expansion, the 
first term in (64) is 



(65) 
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for some f3* on the line segment between /3o and /3. Using the Doob-Meyer 
decomposition, the second term in (64) can be written 

which is a martingale with the optional variation process 
(67) m [ TFT^-^dNiu). 



Yi/3o,u) 

The third term in (64) is zero For all r such that Y{t) > 0. Under the 
regularity conditions of Appendix A, the first and second terms are asymp- 
totically independent, so the asymptotic variance of (64) is 



for all r such that Y{t) > 0. 

B.2. Who-infects-whom is not observed. By an expansion similar 
to that in equation (64), 



m 



V^(Ao{t) - Ao(r)) = V^(a^_5;(^, r) - A^ jiPo, r)) 

+ V^(a^3;(/3o, t) - A/3o,Ao(/3o, r)) 
+ \/m(A^,,Ao(/3o,r)-AS(r)) 
(69) +V^(AS(r)-Ao(r)). 

The fourth term in (69) is zero for all r at which Y{t) > 0. 
By a first-order Taylor expansion, the first term in (69) equals 

d 



(70) (^_A^3.(;3*,r)J V^(/3 - /3o) 

for some f3* on the line segment between Pq and /3, where 

Its contribution to the variance is 

(72) fl^A.r(/3o,r)Vfl7(/3o)') ' ( ^l~~ff],,r) ] . 
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The second term in (69) can be rewritten 
(73) r ^ ^ (diV(n|^, A) - diV(n|/3o, Aq)) 



^(/3o,^^) 

For each j, we have Jg°° dA^(u|/3,A) = 1 if j was infected and otherwise. 
Thus, the term in parentheses is the sum a subset of the random variables 
^ij = PijiPiX) — PijiPo, ^o), which have sum zero for each j. Since the 6ij 
are asymptoticahy independent for different j and Y{f3Q,u) = Op{m), the 
integral behaves asymptotically like a mean of independent random variables 
with mean zero and variance 0(/3 — /3o). Therefore, the second term of (69) 
is Op{(3 — I3q) and converges in probability to zero as m — )• oo. 

The third term in (69) can be evaluated using the conditional variance 
formula. The expression inside the parentheses has the variance 

(74) E;3o,Ao[<^v(/3o,r)]+Var;3„,Ao(Av(/3o,r)) = 

^ diV(n|/3o,Ao) + E;3o,Ao[Av(/3o,r)2] -A^,,Xoil3o,rf, 



Yi^o,uy 



where 



(75) .5(Ar) = £^diVHv). 

Since each infected person has only one infector and infectors can be chosen 
independently given the observed data, 



.[Av(A,r)^]=A«„.(^,.)^-|:(^'^<iiV, 



Eft,A,>[Av(A.r)^] =Aa„,.(f(.r)^-^( / ^77^ diV.jH A. Ao) " 

i= 

(76) + r —^dN{u\l3o,Xo), 

Jo y{po,uy 

where N.j{u\f3,X) = "^i^j Nij{u\l3, X). Therefore, the total variance contri- 
bution of the third term in (69) reduces to 

(77) 2 r diV(n|/3o,Ao)-X;f T tf^^ diV.,H/go. Ao)) • 

Since only the first and third terms of (69) are asymptotically nonzero, 
all that remains is to look at their covariance. Let A^ij(r|v) denote the value 
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of Nij{T) that we would have calculated had we observed the transmission 
network v. Then the corresponding value of the score U{f3, r) is 

and the corresponding covariance of U{f3^r) and A(^,r) is 

(-) -(^. ^) - E E I ^ (I >" -«("W. 

By the conditional covariance formula, 

Cov(^f7/3o,Ao(/3o, Ao), A/3o,Ao(/3o,t)) = Cov^o,Ao(fA^(/3o,T), Av(/3o,T-)) 

(80) +E;3o,Ao[^v(/3o,r)] 

By an argument similar to that leading to (77), this reduces to 

(81) -|;(^'^diV,W„.A„))f/,(A,r). 

In the limit of large m, both terms in (81) act like means of random variables 
with mean zero and finite variance, so they converge in probability to zero. 
Since /5 is a function of the expected score, this implies that the first and 
third terms of equation (69) are asymptotically independent. 

Combining all of these results, the asymptotic variance of (69) is 



2 
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Parameter: dini 





/3s US — 


/3pair — 


/3s US — 


/3pair — 1 


Baseline hazard 


Anf 


Anf 


Anf 


Anf 


a = .5 


.958 


.956 


.945 


.940 


a = 2 


.939 


.932 


.938 


.920 






Parameter: f3sus 






Anf = 


/5pair — 


Anf = 


/3pair — 1 


Baseline hazard 


/3sus 


/^sus 


/3sus 


/^SUS 


a = .5 


.945 


.943 


.946 


.952 


a = 2 


.918 


.921 


.932 


.932 






Parameter: /3pair 






Anf = 


Aus = 


Anf = 


/3sus — 1 


Baseline hazard 


^pair 


/3pair 


^pair 


^pair 


a = .5 


.949 


.928 


.952 


.940 


a = 2 


.950 


.941 


.951 


.934 



Table 1 

95% confidence interval coverage probabilities in simulations. Each probability is based on 

the results of 1000 simulations. 



Basolino hazard 


Q = 


.5 


a 


= 2 


(Juautilo 




Au(r) 


Au(r) 


Mr) 


10% 


.944 


.925 


.956 


.846 


25% 


.947 


.923 


.942 


.809 


50% 


.946 


.924 


.930 


.792 


75% 


.945 


.917 


.908 


.793 


90% 


.942 


.922 


.887 


.797 



Table 2 

95% confidence interval coverage probabilities tn simulations. Each probability is based on 

the results of 6000 simulations. 
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Estimated vs. true pin 



Estimated vs. true %, 




True Pin 




Estimated vs. true ppair 



Iterations 




-1.0 



a. d 



10 



15 



20 



Number of iterations 



Fig 1. The top two panels and the bottom left panel show P (black circles) and jS (gray 
circles) versus true /3 for j5inf, Psus, and Ppair- The bottom right panel shows a histogram 
of the number of EM iterations required for convergence. 



34 



E. KENAH 



True and estimated Ao(x): a = 0.5 



o o 




Contact interval (x) 



Fig 2. Ao(r) (black circles) and Ao(r) (gray circles) versus true Ao(r) for the 6000 simu- 
lations with a Weibull(0.5, 0.2 ) baseUne contact interval distribution. For each simulation, 
a circle is shown for the 10*, 25*'', SO"", 75*, and 90* percentiles of all possible contact 
intervals. The smoothed means were calculated using cubic smoothing splines. 
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Fig 3. Ao(r) (black circles) and Ao(r) (gray circles) versus true Ao(r) for the 6000 simu- 
lations with a Weibull(2, 0.6) baseline contact interval distribution. For each simulation, 
a circle is shown for the 10*, 25*'', SO"", 75*, and 90* percentiles of all possible contact 
intervals. The smoothed means were calculated using cubic smoothing splines. 
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versus Pj„, p,,, versus p, 




True Ppair Contact interval (x) 



Fig 4. The width of 95% confidence intervals for /Jsus, Ppair, and Ao(r) in terms 
of the confidence interval width of $inf, $aus, Ppair, and Ao(t). The solid gray lines show 
smoothed means for a = 0.5 and dashed gray lines show smoothed means for a = 2. The 
smoothed means were calculated using cubic smoothing splines. 
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Covariates 

Model ageint agCaus prophysus Interaction terms 



Uni variable 


1.53 (0.66, 3.54) 
p= .321 


0.41 (0.20, 0.85) 
p = .016 


0.43 (0.18, 1.02) 
p = .057 








Multivariable 


1.78 (0.69, 4.62) 
p = .234 


0.69 (0.29, 1.64) 
p = .399 


0.41 (0.17, 0.98) 
p = .046 








Multivariable 
+ interaction 


1.59 (0.32, 7.84) 
p = .570 


0.63 (0.14, 2.73) 
p = .532 


0.04 (0.00, 9.62) 
p = .253 


ageinf:agesus 
ageinfiprophsuB 
agesusiprophsus 
Likelihood ratio 


0.66 (p 
9.28 (p 
2.72 [p 
p = 


= .705) 
= .450) 
= .361) 
.101 


Stratified 


strata 


0.69 (0.29, 1.64) 
p = .401 


0.41 (0.17, 0.99) 
p = .047 








Stratified 
+ interaction 


strata 


0.52 (0.29, 1.64) 
p= .219 


0.23 (0.05, 1.16) 
p = .075 


ageinf:proplisuB 
Likelihood ratio 


2.37 (p 

p = 


= .379) 
.353 



Latent period 
1 day 

Infectious period 
5 days 



Sensitivity analysis (multivariable model without interaction) 



7 days 



1.44 (0.64, 3.26) 
p = .378 

1.59 (0.60, 4.20) 
p = .348 

1.45 (0.62, 3.40) 
p = .378 



0.83 (0.36, 1.93) 
p = .670 

0.64 (0.27, 1.55) 
p = .322 

0.89 (0.38, 2.04) 
p = .670 



0.35 (0.15, 0.80) 
p = .013 

0.45 (0.18, 1.07) 
p = .073 

0.34 (0.17, 0.87) 
p = .013 



Table 3 

Hazard ratios and p-values for different models of the 2009 pandemic influenza A(HINI) 
household surveillance data from Los Angeles County. The multivariable and stratified 
models without interaction were used as the final models. 
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Fig 5. Household transmission of 2009 pandemic influenza A(HINI) in Los Angeles 
County. Each panel shows separate curves for susceptible contacts with (gray lines) and 
without (black lines) antiviral prophylaxis. The solid lines are based on the multivariable 
model without interaction. The dotted lines are based on the model stratified by agei„f 
without interaction. 



